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Abstract. In nature the three-dimensional structure of a protein is encoded 
in the corresponding gene. In this paper we describe a new method for en- 
coding the three-dimensional structure of a protein into a binary sequence. 
The feature of the method is the correspondence between protein-folding and 
"integration" . A protein is approximated by a folded tetrahedron sequence. 
And the binary code of a protein is obtained as the "second derivative" of the 
shape of the folded tetrahedron sequence. With this method at hand, we can 
extract static structural information of a protein from its gene. And we can 
describe the distribution of three-dimensional structures of proteins without 
any subjective hierarchical classification. 
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1. Overview 

In nature the three-dimensional structure of a protein is encoded in the corre- 
sponding gene. In this paper we describe a new method for encoding the three- 
dimensional structure of a protein into a binary sequence (Fig^). 

In the method a protein is approximated by a tetrahedron sequence. For exam- 
ple, approximation Fig^b) is obtained by folding tetrahedron sequence Fig^c), 
where three tetrahedrons are assigned for each amino-acid. We would obtain more 
precise approximation if we use more tetrahedrons. 

The feature of the method is the correspondence between protein-folding and 
"integration" . And the binary sequence is obtained as the "second derivative" of 
the shape of the folded tetrahedron sequence. 

With this method at hand, we can extract static structural information of a 
protein from its gene. And we can describe the distribution of three-dimensional 
structures of proteins without any subjective hierarchical classification. 

2. Basic idea: encoding of two-dimensional objects 

For simplicity we shall explain the basic idea behind the paper in the case of 
two-dimensional objects, where we use triangle sequences for approximation. 

2.1. Triangle sequence. Consider a unit cube in the three-dimensional Euclidean 
space M 3 whose vertices are given by v\, v x , v y , v z , . . ., and v xyz , where v x i y m z n :— 
(l,m,n) € Z 3 (Fig[5{a)). And draw lines viv xy , v\v yz and v\v xz . Then, each of 
three upper faces is divided into two slant-triangle-tiles. For example, v\v x v xy v y is 
divided into two slant-tiles v\v x v xy and v\v y v xy . 

Firstly, by piling up these unit cubes in the direction from v xyz to v\ , we obtain 
"peaks and valleys" with a "drawing" on it. The drawing is uniquely determined 
by its peaks and divides the surface into a collection of slant-triangle-tile sequences. 
For example, the drawing of FigEfb) is determined by two peaks, left (2, 0, —1) and 
right (0, 0, 0). And we denote the drawing by Cone*{x 2 / z, 1}. 




Figure 1. Overview, (a): Schematic diagram of 2HIU chain A 
(Insulin, human), (b): Approximation of 2HIU by a tetrahedron 
sequence, (c): The U/D sequence of approximation (b) . (d): The 
amino-acid sequence of 2HIU. (The figure (a) is prepared using 
WebLab Viewer (Molecular Simulations Inc.).) 
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FIGURE 2. Basic idea, (a): Unit cube in K 3 and its projection 
on H. (b): Slant-tile sequences and flat-tile sequences defined by 
Cone* {x 2 j z, 1}. (c): Slant-tiles over a flat-tile on H. 



Secondly, by the projection onto the hypersurface H := {(a, 6, c) E R 3 | a+b+c = 
0}, we obtain a division of H into a collection of flat-triangle-tile sequences. For 
example, the gray slant-tile sequence is projected onto the gray flat-tile sequence 
on H in Fig^b). We write a[uv] for slant-tile v a v au v auv and |<z[uu]| for the corre- 
sponding flat-tile. For example, l[xy] for viv x v xy . Note that there are three types 
of slant-tiles over a flat-tile (Figl^c)). We shall see in the appendix that "peaks 
and valleys" specifies a "discrete vector field" of flat-tiles on H. 

Finally we obtain a binary code of the shape of a flat-tile sequence by arranging 
up (U) and down (D) of the corresponding slant-tile sequence. For example, the 
gray flat-tile sequence in Fig|2{b) is encoded into U/D sequence 

U-U-D-D-U-U-U-D-D. 

In general we need more than one drawing to encode a flat-tile sequence because 
of overlaps among its peaks (FiglSJc)). Each drawing encodes a part of the flat- tile 
sequence and its code is obtained by patching those "local codes" together. 

2.2. Encoding of two-dimensional objects. Now let's encode the two-dimensional 
object shown in FigOHa). First of all we should give a flat-tile sequence which ap- 
proximates the object (Fig|3{b)). 

Then, using encoding table Table ^a), we obtain a binary code of the object 
(FiglH c ))- The process is going on as follows: 
Step 1. Choose an initial value, say U, 

Step 2. By the second row of the table, the second value is U, 
Step 3. By the fourth row of the table, the third value is D, .... 

As the result we obtain U / D sequence 

(1) U-U-D-D-U-U-U -D-D -U-U-D-D-D-U -D-D-D-D. 

FigEI c ) shows the corresponding slant-tile sequence. In this case we need two 
drawings because of the overlap between two peaks z 2 /(y 2 x) and z 3 /x 2 . The left 
drawing Cone*{l, z/y 2 , z 2 /(xy 2 ), z 3 /x 2 } corresponds to the first sixteen tiles and 
the right drawing Cone*{z 3 /x 2 } to the last five tiles. 
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Figure 3. Encoding of a two-dimensional object, (a): Two- 
dimensional object. (b): Approximation by a triangle se- 
quence, (c): Two drawings Cone*{l, z/y 2 , z 2 /(xy 2 ), z 3 /x 2 } and 
Cone* {z 3 /x 2 } which encode approximation (b). 
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Table 1. Tables for two-dimensional objects, (a): Encoding ta- 
ble, (b): Decoding table. (The gray tile is the current one.) 



2.3. Decoding of U/D sequences in IR 2 . To decode U/D sequences in R 2 we 
use decoding table Table Ub). For example, decoding process of U/D sequence {QJ 
is going on as follows: 

Step 1. Choose an initial flat-tile, say |x[yx]|, 

Step 2. By the fourth row of the table, the second flat-tile is 

Step 3. By the third row of the table, the third flat-tile is |l[a;z]|, .... 

As the result we obtain the flat-tile sequence shown in Fig|3fb). 

3. Encoding of three-dimensional objects 

If we consider unit cubes in the four-dimensional Euclidean space R 4 , we shall 
obtain a three-dimensional drawing made up of slant- "tetrahedron" -tiles. And we 
approximate a three-dimensional object by a tetrahedron sequence (Fig^c)), where 

(1) each tetrahedron consists of four short edges and two long edges, where the 
ratio of the length is y/3/2 and 

(2) successive tetrahedrons are connected via a long edge and have the rota- 
tional freedom around the edge. 
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Figure 4. Encoding of three-dimensional objects, (a): Unit cube 
in R 4 and its projection on H. (b): Slant-tile sequence and flat- 
tile sequence defined by three peaks P\ = 1, P2 = z 2 /(x 3 yw), and 
P3 = z 2 1 '(x 2 y 2 w). (c): Slant-tiles over a flat-tile on H. (In the 
figures arrows indicate the direction of "down".) 



3.1. Tetrahedron sequence. Consider a unit cube in the four-dimensional Eu- 
clidean space R 4 whose vertices are given by V\, v x , v y , v z , v w , . . ., and v xyzw , 
where v x i y m z n m k '■— (l,m,n,k) S Z 4 (Fig^fa)). And divide each of four upper 
three-dimensional faces into six slant-tetrahedron-tiles. For example, the face de- 
fined by Vx, Vx, v z , and v y is divided into six slant-tiles viv x v xy v xyz , viv y v yx v yxz , 
viv y v yz v yzx , viv z v zy v zyx , vxv x v xz v xzy , and viv z v zx v zxy . 

Firstly, by piling up these unit cubes in the direction from v xyzw to Vx, we ob- 
tain four-dimensional "peaks and valleys" with a three-dimensional "drawing" on it. 
The drawing is uniquely determined by its peaks and divides the three-dimensional 
surface into a collection of slant-tetrahedron-tilc sequences. For example, the draw- 
ing of FigQfb) is determined by three peaks Pi , Pi and P3 . And we denote the 
drawing by Cone*{P 1 , P 2 , P3}. 

Secondly, by the projection onto the hypersurface H := {(a,b,c,d) e I 4 | a + 
b + c + d = 0}, we obtain a division of H into a collection of flat-tetrahedron-tile 
sequences. For example, Fig^Jb) shows a slant-tile sequence and its projection onto 
H . We write ajwuw] for slant-tile v a v au v auv v auvw and |<z[mw] | for the corresponding 
flat-tile. Note that there are four types of slant-tiles over a flat-tile (Fig0Jc)). For 
example, |l[xyz]| = |x[y2Ui]| = |x?/[zwa;]| = |xyz[wxy]|. 

Finally we obtain a binary code of the shape of a flat-tile sequence by arranging 
up (U) and down (D) of the corresponding slant-tile sequence. For example, the 
flat-tile sequence shown in Fig^Jb) is encoded into U/D sequence 

(2) U 6 -D 4 -U 7 -D-U-D. 

3.2. Encoding of three-dimensional objects. To encode three-dimensional ob- 
jects we use encoding table Table Ufa). For example, encoding of the flat-tile se- 
quence shown in FigQfb) proceeds as follows: 

Step 1. Choose an initial value, say U, 

Step 2. By the second row of the table, the second value is U, 
Step 3. By the second row of the table, the third value is U, .... 

As the result we obtain U/D sequence (j2J. 
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Table 2. Tables for three-dimensional objects, (a): Encoding 
table, (b): Decoding table. (The gray tile is the current one.) 



3.3. Decoding of U/D sequences in R 3 . To decode U/D sequences in R 3 we use 
decoding table Table Efb). For example, decoding of U/D sequence © proceeds 
as follows: 

Step 1. Choose an initial flat-tile, say \xw 2 z 2 [xzw] |, 

Step 2. By the fourth row of the table, the second flat-tile is |a;wz 2 [wa;z] |, 
Step 3. By the fourth row of the table, the third flat-tile is |xwz[zwa;] |, .... 

As the result we obtain the flat-tile sequence shown in Fig^b). 



4. Examples 

4.1. Double helix. Here let's consider the double helix shown in FigJSfa) which 
has 12 tiles per turn. (Cf. DNA has an average of 10.9 (type A) or 10 (type B) 
nucleotide pairs per turn (PP).) To encode the shape of the helix, it is enough to 
consider the flat-tile sequence shown in Fig^b). 

Using TableE^a) with initial slant-tile y[zxy), we obtain two drawings of FigE{c). 
Cone* {Pi, P2} (left) encodes the first ten tiles. And Cone* {Pi, P3} (right) encodes 
the last ten tiles. By patching these local codes together, we obtain the U/D code 
of helix FigGJb): 



U-U -D-D-D-D - U -U - D - D - D-D-U -U -D-D. 



4.2. 2HIU chain A (Insulin, human). Next let's consider the three-dimensional 
structure of 2HIU chain A (FigQJ. Using Table^a) with initial slant-tile zw[xyz], 
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(a) (b) (c) 



Figure 5. Double helix, (a): Double helix formed by two tetra- 
hedron sequences, (b): Part of the helix, (c): Slant-tile sequence 

and flat-tile sequence defined by Pi = 1, P2 = y 2 z/x, and P3 = 

2 2 
y w . 



we obtain eight drawings: 

- Cone*{z/y,l/{x 2 w),l/{x 2 z)} for [1,14], 

- Cone*{l/{xy),l/{x 2 w),l/{x 2 z)} for [7, 18], 

- Cone*{l/(xy),l/(x 3 zw),l/(x 3 z 2 ),w/(xyz)} for [13,29], 

- Cone*{l/(xyz 2 ),l/(x 3 zw),l/(x 3 z 2 ),xw/y 2 } for [16,42], 

- Cone*{xw 2 /(yz),w/y, xw/y 2 } for [36,45], 

- Cone*{xw 2 /{yz),l/y 2 ,x/y 3 } for [40,51], 

- Cone*{x/(y 4 z),l/y 2 ,x/(y 4 w)} for [45,57], 

- Cone*{x/(y 4 z), l/(y 4 w 2 )} for [52,63]. 

([n,m] denotes the part of the sequence from the n-th tile to the m-th tile.) 

By patching these local codes together, we obtain the U/D code of the three- 
dimensional structure of the protein (Fig^(c)): 
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Table 13 shows the correspondence between the U/D code and the amino-acid se- 
quence of the protein. (Also see Fig^c) and (d).) 
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Table 3. U/D code and the amino- acid sequence of 2HIU chain 
A. (0 denotes D — D — D , 1 denotes D — D — U and so on.) 



No. 


1 


2 
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Amino-acid 
U/D code 


GLY 
7 


ILE 

3 


VAL 
6 


GLU 
3 


GLN 
1 


CYS 
3 


CYS 
6 


THR 

3 


No. 


9 


10 


11 


12 


13 


14 


15 


16 


Amino-acid 
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Appendix A. Differential geometry of A-hedron tiles 

A.l. Space of A-hedron tiles. Let Lm* be the collection of all integer points of 
the A-dimensional Euclidean space R w : 

L N * := {x 1 n x 2 12 • --x N lN \k€Z for all i] . 

And consider the collection S of all "slant" A-hedrons defined by L^*: 

S := {a [x p (i) • • • aip(jv-i)] | a £ L N * , p e S N } , 

where «5V is the A-th symmetric group and a [% p (i) • • ■ x p m— n] denotes the convex 
hull conv[aQ, oi, . . . , djv-i] of N points ciq = a, a\ = ax p n) , . . . , a^-i — ax p mx p t2) 
■ ■■XptN-i) in R N : 

a [x p (i) ■ ■ ■ ai^jv-i)] := < JJ <h* | < Xi € E s.t. ^ \ = 1 

[o<i<W 0<i<N 

The collection B of all "flat" TV-hedrons is defined as the quotient of S by "shift 
operator" a on S (FigUJa)). That is, B := S/a, where 

a ( a [ x p(i) ' ' ' x p(n-i)] ) ■= ax p (i) [x p (2) ■ ■ ■ x p(N)] ■ 

A. 2. Differential structure on B. "Tangent bundle" T[B] on B is defined as 
the quotient of S by a N : 

T[B] := S/a N , 

7r : T[B] -> B, tt (s mod a N ) := s mod a. 

We identify T[B] with B x {e/xi, e/x2, ■ ■ ■ , e/a;Ar} (e = X1X2 ■ • • xjv) by one-to- 
one correspondence 

s mod cr ~ (s mod a,Ds), 
where the "gradient" Ds of s S S 1 is defined by 

Da [a; p(1) • • • ^(at-i)] := a; p(1) • • • x p(Ar _i) = e/x p(N) . 
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(a) (b) (c) 

Figure 6. Differential geometry of 3-hedron tiles, (a): Fiber of 
S over a point of B. (b): The local trajectory specified by s G S. 
(c): The second derivative along orbit {f[i]}. 



Let s — a [xp(iy ■ ■ x p (N_x\] G S. Then s mod a N G T[B] specifies "local 
trajectory" {su mod a,s mod u,sr> mod cr} at s mod a € B (Fig|SJb)), where 

su := a [iCp(i) • • • x p ( N _2) x p(N)] , 
s D := ax p{1) [x p (2) ■ ■ ■ Xp(j V _ 1 )X p ( 1 )] . 
And we shall obtain a flow on _B by patching these local trajectories together. 

A.3. Cones and their boundary surfaces. Let PHN^ := {Cone* A | A C L N *}, 
where 

Cone* A := { P x 1 ll x 2 n ■ ■ ■ x N lN &L N * \ peA and < k e Z for all i} . 

That is, FUN N is the collection of all "cones" defined by Ln*. And we denote the 
"boundary sur faces" of w € PHN^ by d s w: 

dsw := {conv[ao, a%, ... , Ojv-i] G 5 1 | lw{&i) — for all i} , 

where l w (z) := max peilI |mini<i<jv \li G Z | lli<i<Aryj h = f° r z G Ljv*. 

The boundary surfaces of a cone induce a vector field on S. 

A. 4. Vector field on B. Let w G PHN^. Then dgw specifies a unique iV-hedron 
s G dsw over each £ G B, which we denote by T w (t): 

T w {t) '■= the unique A^-hedron s G dsw s.t. t = s mod cr. 

And T w induces vector field X w over B: 

X w (s mod tr) := DT w (s mod cr). 

Let {t[i]} C B be a trajectory defined by vector field X w . And we define the 
"second derivative" D 2 T w (t[i]) of T w along {t[i]} as a {J7, Z?}-valued function by 



D 2 Y w {t[i + l\) := 



D 2 T w {t[i]) if A w (t[z + 1]) = X w {t[i\), 
-D 2 T w (t[i}) else, 



where —D := U and -U := D (Figj^c)). 

Then we can encode the N — 1-dimcnsional structure of any trajectory by the 
second derivative along the trajectory, i.e., an U/D sequence. 
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